image.png

Introduction

Welcome to my Data Science project on CHURN PREDICTION for bank customers.

A real dataset with user characteristics & behavioral features for 10,000 customers of a bank is explored, processed and utilized to train a Machine Learning model to predict churn probability. Data exploration and model interpretation findings are turned into actionable information for the bank to reduce churn. An optimized probability cutoff is estimated for the classification model, to minimize bank costs. The following sections will navigate you through all the above.

I also invite you to visit My LinkedIn profile and see my other projects in My GitHub profile.

Sincerely,

Michail Mavrogiannis


Data Overview

Dataset References & Import

The dataset used as input in the present project was obtained from the "Churn for Bank Customers" post, created by user Mehmet A., on kaggle.com website. The dataset was published on kaggle.com under "CCO: Public Domain" License. The dataset column names and order have been modified for easier reference.

Import libraries below:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats

Import dataset into dataframe "dfr".

In [2]:
dfr = pd.read_csv('C:/Users/Michael/Desktop/Data Science- MM/Churn/Data.csv')
dfr.head()
Out[2]:
surname customer_id country age gender balance num_products tenure credit_card active_member credit_score salary churn
0 Hargrave 15634602 France 42 Female 0.00 1 2 1 1 619 101348.88 1
1 Hill 15647311 Spain 41 Female 83807.86 1 1 0 1 608 112542.58 0
2 Onio 15619304 France 42 Female 159660.80 3 8 1 0 502 113931.57 1
3 Boni 15701354 France 39 Female 0.00 2 1 0 0 699 93826.63 0
4 Mitchell 15737888 Spain 43 Female 125510.82 1 2 1 1 850 79084.10 0

Description of Features:

  • surname: Customer surname.
  • customer_id: Unique ID assigned by the bank.
  • country: Country of residence.
  • age: Customer age.
  • gender: Customer gender.
  • balance: Customer account balance.
  • num_products: Number of products the customer has purchased with the bank.
  • tenure: Time period in years, for which the customer has been with the bank.
  • credit_card: Whether a customer has a credit card; 1: yes, 0: no.
  • active_member: Whether the customer is active (based on criteria defined by the bank); 1: yes, 0: no.
  • credit_score: Credit score in the range 350 - 850.
  • salary: Customer estimated salary by the bank.
  • churn: Whether the customer left the bank; 1: yes, 0: no.

Column 'surname' will be omitted since it does not provide information useful to predicting customer churn probability:

In [3]:
dfr.drop('surname', axis = 1, inplace = True)

Summary Statistics

In [4]:
dfr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customer_id    10000 non-null  int64  
 1   country        10000 non-null  object 
 2   age            10000 non-null  int64  
 3   gender         10000 non-null  object 
 4   balance        10000 non-null  float64
 5   num_products   10000 non-null  int64  
 6   tenure         10000 non-null  int64  
 7   credit_card    10000 non-null  int64  
 8   active_member  10000 non-null  int64  
 9   credit_score   10000 non-null  int64  
 10  salary         10000 non-null  float64
 11  churn          10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB
In [5]:
dfr.drop('customer_id', axis = 1).describe()
Out[5]:
age balance num_products tenure credit_card active_member credit_score salary churn
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000 10000.000000 10000.000000 10000.000000
mean 38.921800 76485.889288 1.530200 5.012800 0.70550 0.515100 650.528800 100090.239881 0.203700
std 10.487806 62397.405202 0.581654 2.892174 0.45584 0.499797 96.653299 57510.492818 0.402769
min 18.000000 0.000000 1.000000 0.000000 0.00000 0.000000 350.000000 11.580000 0.000000
25% 32.000000 0.000000 1.000000 3.000000 0.00000 0.000000 584.000000 51002.110000 0.000000
50% 37.000000 97198.540000 1.000000 5.000000 1.00000 1.000000 652.000000 100193.915000 0.000000
75% 44.000000 127644.240000 2.000000 7.000000 1.00000 1.000000 718.000000 149388.247500 0.000000
max 92.000000 250898.090000 4.000000 10.000000 1.00000 1.000000 850.000000 199992.480000 1.000000

Missing Values, Unreasonable Values, Outliers

The dataset does not contain missing values, as shown above.

In the general case that a dataset contains missing data, the frequency (in how many rows/columns) and the reason for being missing (completely at random, at random, not at random) shall be examined. Depending on the case, null data can be either omitted, or imputed using summary statistics (e.g. mean, mode), or estimated through a Machine Learning model.

In the following sections all features will be explored, and any unreasonable values or outliers will be identified.

Data Exploration

This section presents an exploratory analysis of the response variable (churn) and the customers' characteristic and behavioral features. The distribution of each feature as well as the variance of the churn probability per the feature values are obtained and evaluated through analytics, graphs, and decision trees as required. Preliminary conclusions are reached about the relationship between the features and the churn probability.

Response Variable: churn

In [6]:
dfr['churn'].value_counts(normalize = True)
Out[6]:
0    0.7963
1    0.2037
Name: churn, dtype: float64

The overall churn rate is approximately 20%, which is within the expected range for retail banks.

In [7]:
sns.countplot(dfr['churn'])
plt.title('Churn Histogram'); plt.ylabel('# of customers'); plt.show()

customer_id

In [8]:
dfr['customer_id'].nunique(), len(dfr)
Out[8]:
(10000, 10000)

It is confirmed that the dataset contains unique customers only.

country

This section explores the feature of country of residence for the bank customers, and provides a per-country summary of the total number of customers, their portion within the entire dataset, number of customers who remained with or left the bank, and churn probability. This information is provided;

  • analytically through a summary dataframe, and
  • visually through graphs.

In the next sections for the rest of the features, the summary dataframe will be omitted and the above information will be provided visually.

In [9]:
country = pd.DataFrame()
country['Customers'] = dfr.groupby('country').count()['customer_id']
country['Customer Portion'] = (country['Customers'] / len(dfr)).round(2)
country['Remained'] = dfr[dfr['churn'] == 0].groupby('country').count()['customer_id']
country['Churned'] = dfr[dfr['churn'] == 1].groupby('country').count()['customer_id']
country['Churn Probability'] = dfr.groupby('country').mean()['churn'].round(2)
country
Out[9]:
Customers Customer Portion Remained Churned Churn Probability
country
France 5014 0.50 4204 810 0.16
Germany 2509 0.25 1695 814 0.32
Spain 2477 0.25 2064 413 0.17
In [10]:
graph = sns.countplot(data = dfr, x = 'country', color = 'lightblue', order = ['France', 'Germany', 'Spain'])
plt.title('Country Histogram'); plt.ylabel('# of customers'); plt.show()
In [11]:
sns.countplot(data = dfr, x = 'country', hue = 'churn', order = ['France', 'Germany', 'Spain'])
plt.title('Churn Histogram per Country'); plt.ylabel('# of customers'); plt.show()
In [12]:
sns.barplot(data = dfr, x = 'country', y = 'churn', estimator = np.mean, ci = None, color = 'lightgreen', 
            order = ['France', 'Germany', 'Spain'])
plt.ylabel('churn probability'); plt.title('Churn Probability per Country'); plt.show()

It is observed that:

  • Germany constitutes a significant portion of the customers, approximately 25%, and
  • has almost double churn probability than the rest of the countries.

age

In [13]:
dfr['age'].min(), dfr['age'].max()
Out[13]:
(18, 92)
In [14]:
plt.figure(figsize = (18,5))
sns.countplot(data = dfr, x = 'age', color = 'lightblue', order = range(18, 93))
plt.title('Age Histogram'); plt.ylabel('# of customers'); plt.show()
In [15]:
plt.figure(figsize = (18,5))
sns.countplot(data = dfr, x = 'age', hue = 'churn', order = range(18, 93))
plt.title('Churn Histogram per Age'); plt.ylabel('# of customers'); plt.show()

The middle 80% of the age observations fall within the interval:

In [16]:
stats.scoreatpercentile(dfr['age'], 10), stats.scoreatpercentile(dfr['age'], 90)
Out[16]:
(27.0, 53.0)
In [17]:
plt.figure(figsize = (18,5))
sns.barplot(data = dfr, x = 'age', y = 'churn', order = range(18, 93), estimator = np.mean, ci = None, color = 'lightgreen')

# Plot vertical lines to delimit the middle 80% of the observations:
age_list = list(dfr['age'].sort_values().unique())
plt.axvline(x = age_list.index(stats.scoreatpercentile(dfr['age'], 10)), color = 'blue', linestyle = '--', 
            label = 'Middle 80% \n Limits')
plt.axvline(x = age_list.index(stats.scoreatpercentile(dfr['age'], 90)), color = 'blue', linestyle = '--')

plt.ylabel('churn probability'); plt.title('Churn Probability per Age'); plt.legend(); plt.show()

It is observed that:

  • For the middle 80% of the observations, i.e. for ages from 27 to 53 years,
  • the churn probability increases almost exponentially with age.

gender

In [18]:
sns.countplot(data = dfr, x = 'gender', color = 'lightblue')
plt.title('Gender Histogram'); plt.ylabel('# of customers'); plt.show()
In [19]:
sns.countplot(data = dfr, x = 'gender', hue = 'churn')
plt.title('Churn Histogram per Gender'); plt.ylabel('# of customers'); plt.show()
In [20]:
sns.barplot(data = dfr, x = 'gender', y = 'churn', color = 'lightgreen', ci = None)
plt.ylabel('churn probability'); plt.title('Churn Probability per Gender'); plt.show()

It is observed that:

  • Women represent a little less than 50% of the customers, and
  • have a considerably higher churn probability than men.

balance

Balances are categorized into bins of width 10,000. Based on the customer countries, Euro (€) currency is assumed.

In [21]:
balance_bins = np.linspace(-0.1, 260000, 27)
balance_labels = [f"[{j}k, {j+10}k)" for j in range(0, 260, 10)]
dfr['balance_bin'] = pd.cut(x = dfr['balance'], bins = balance_bins, labels = balance_labels)
dfr[['balance','balance_bin']].head(3)          
Out[21]:
balance balance_bin
0 0.00 [0k, 10k)
1 83807.86 [80k, 90k)
2 159660.80 [150k, 160k)
In [22]:
plt.figure(figsize = (14,5))
sns.countplot(data = dfr, x = 'balance_bin', color = 'lightblue')
plt.title('Balance Histogram'); plt.xlabel('balance [€]'); plt.xticks(rotation = 70); plt.ylabel('# of customers') 
plt.show()
In [23]:
len(dfr[(dfr['balance'] > 0) & (dfr['balance'] <= 10000)])
Out[23]:
1

The above histogram of balances does not have the expected form of a highly right-skewed distribution or even a distribution close to a half-normal.

In addition to this, almost 1/3 of observations have a zero balance. This is further investigated in the next section.

In [24]:
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'balance_bin', hue = 'churn')
plt.title('Churn Histogram per Balance'); plt.xlabel('balance [€]'); plt.xticks(rotation = 70); plt.ylabel('# of customers')
plt.legend(loc = ('upper right'), title = 'churn'); plt.show()

Excluding the zero-balance observations, the middle 80% of the remainder observations fall within the interval:

In [25]:
print(stats.scoreatpercentile(dfr[dfr['balance'] != 0]['balance'], 10).round(0), 
      stats.scoreatpercentile( dfr[dfr['balance'] != 0]['balance'], 90).round(0))
81552.0 157756.0
In [26]:
plt.figure(figsize = (14,5))
sns.barplot(data = dfr, x = 'balance_bin', y = 'churn', estimator = np.mean, color = 'lightgreen', ci = None)

# Plot vertical lines to delimit the middle 80% of the observations excluding zero balances:
balance_list = list(dfr['balance_bin'].sort_values().unique())
plt.axvline(x = balance_list.index(pd.cut(x = [stats.scoreatpercentile(dfr[dfr['balance'] != 0]['balance'], 10)], 
    bins = balance_bins, labels = balance_labels)[0]),
    color = 'blue', linestyle = '--', label = 'Middle 80% Limits, \n Excluding 0 Balances')
plt.axvline(x = balance_list.index(pd.cut(x = [stats.scoreatpercentile(dfr[dfr['balance'] != 0]['balance'], 90)], 
    bins = balance_bins, labels = balance_labels)[0]),
    color = 'blue', linestyle = '--')

plt.ylabel('churn probability'); plt.xlabel('balance [€]'); plt.xticks(rotation = 70); plt.legend()
plt.title('Churn Probability per Balance'); plt.show()

It is observed that:

  • The balance distribution does not seem to have an expected shape, and almost 1/3 of the customers have a zero balance. This is further investigated in the next section.
  • The middle 80% of the non-zero observations (which constitute almost 2/3 of the customers) do not present a big variance in churn rate.
  • Also, customers with a non-zero balance have a considerably higher churn rate than customers with zero balance (1/3 of the customers).

balance: Further Exploration

In [27]:
len(dfr[dfr['balance'] == 0]), len(dfr[dfr['balance'] == 0]) / len(dfr)
Out[27]:
(3617, 0.3617)

In order to investigate zero balances, the balance feature will be further "sliced" as per other features of the dataset.

A decision tree is used to find the features whose importance stands out when determining whether a customer has a zero balance or not. To obtain these features a "min_impurity_decrease" limit is used, determined after trial-and-error. The training set of the model includes the predictors: country (dummies), age, gender (dummies), num_products, tenure, credit_card, active_member, credit_score, salary. The response variable is whether a user has zero balance or not (a new column is added to the dataframe for that). No train/test set split is required here.

In [28]:
dfr0 = pd.concat(
       [dfr[['age', 'num_products', 'tenure', 'credit_card', 'active_member', 'credit_score', 'salary']], 
        pd.get_dummies(dfr[['gender', 'country']])], 
        axis = 1)
dfr['zero_balance'] = dfr['balance'].apply(lambda x: 1 if x == 0 else 0)
In [29]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
mdl0 = DecisionTreeClassifier(min_impurity_decrease = 0.01)
mdl0.fit(dfr0, dfr['zero_balance'])
plt.figure(figsize=(15,12)); plot_tree(mdl0, max_depth = 20, fontsize = 20); plt.show()
In [30]:
dfr0.columns[[10, 1]]
Out[30]:
Index(['country_Germany', 'num_products'], dtype='object')

The features of outstanding importance for zero balance are 'Germany' and 'num_products'.

For Germany:

In [31]:
sns.countplot(data = dfr, x = 'country', hue = 'zero_balance')
plt.title('Distribution of Zero Balances per Country'); plt.ylabel('# of customers'); 
plt.legend(loc = (1.01, 0.78), title = 'zero_balance'); plt.show()

Surprisingly, France and Spain have almost 50% of their customers with zero balances, whereas Germany has no customers with zero balance.

For number of products:

In [32]:
sns.countplot(data = dfr, x = 'num_products', hue = 'zero_balance')
plt.title('Distribution of Zero Balances per # of Products'); plt.ylabel('# of customers'); 
plt.legend(loc = (1.01, 0.78), title = 'zero_balance'); plt.show()

Customers with 2 products present a remarkably higher probability of having zero balance. This could be because their main product is different than a checking or savings account, so they just also maintain a zero-balanced account.

From the above exploration:

  • Although for this project there is no additional data available, in real conditions more information should be requested from the bank by the data scientists, to investigate zero balances. Customers with zero balance may fall into one of the following categories: a) have only one product (account) with zero balance, b) they do not have an account product at all but they are reported as zero-balanced, c) they have a zero-balanced account but also other products. If additional features regarding this information were available, we could further "slice" the customers and shed more light on the intent of churn.
  • Customers with a zero balance constitute almost 50% of the customers in France and Spain, but they are absent in Germany. It would be good to know the reason; it may be that in Germany you cannot maintain a zero account, or cannot have a third product with the bank without also having an account with some balance.
  • Customers with two products are more likely to have a zero balance. They may fall into category (c) described above, i.e. they have a zero-balanced account but also other products.
  • It would be useful to know whether this dataset was intended to be representative of all the bank customers or only of a special subset of those, and
  • whether there was a bug in the algorithm used by the bank to select the customer sample for this dataset.
  • Given the absence of additional data and the purpose of this project, the analyses in the next sections will be conducted with the dataset on hand.

num_products

In [33]:
sns.countplot(data = dfr, x = 'num_products', color = 'lightblue')
plt.title('Number of Products Histogram'); plt.ylabel('# of customers'); plt.show()
In [34]:
sns.countplot(data = dfr, x = 'num_products', hue = 'churn')
plt.title('Churn Histogram per Number of Products'); plt.ylabel('# of customers'); 
plt.legend(loc = ('upper right'), title = 'churn'); plt.show()
In [35]:
sns.barplot(data = dfr, x = 'num_products', y = 'churn', estimator = np.mean, ci = None, color = 'lightgreen')
plt.ylabel('churn probability'); plt.title('Churn Probability per Number of Products'); plt.show()

As seen above:

  • The vast majority of customers have either 1 or 2 products, and the portions of customers in these two categories are about equal.
  • Customers with 1 product have almost a 3-ply churn probability than customers with 2 products. Therefore going from the possession of 1 product to 2 products seems to be the "turning point" for customer retention.

tenure

In [36]:
sns.countplot(data = dfr, x = 'tenure', color = 'lightblue')
plt.title('Tenure Histogram'); plt.xlabel('tenure in years'); plt.ylabel('# of customers'); plt.show()
In [37]:
sns.countplot(data = dfr, x = 'tenure', hue = 'churn')
plt.title('Churn Histogram per Tenure'); plt.xlabel('tenure in years'); plt.ylabel('# of customers'); 
plt.legend(loc = (1.01, 0.78), title = 'churn'); plt.show()
In [38]:
sns.barplot(data = dfr, x = 'tenure', y = 'churn', estimator = np.mean, ci = None, color = 'lightgreen')
plt.xlabel('tenure in years'); plt.ylabel('churn probability'); plt.title('Churn Probability per Tenure'); plt.show()

It is observed that:

  • Customer tenure is distributed almost uniformly, except for the 0 and 10 years, and
  • churn probability does not vary significantly or in a clear trend, per tenure.
  • Since tenure is limited to 10 years, it seems that the dataset is not representative of all the customers of the bank (assuming the bank is older than 10 years).

credit_card

In [39]:
sns.countplot(data = dfr, x = 'credit_card', color = 'lightblue')
plt.title('Credit Card Possession Histogram'); plt.ylabel('# of customers'); plt.show()
In [40]:
sns.countplot(data = dfr, x = 'credit_card', hue = 'churn')
plt.title('Churn Histogram per Credit Card Possession'); plt.ylabel('# of customers'); plt.show()
In [41]:
sns.barplot(data = dfr, x = 'credit_card', y = 'churn', color = 'lightgreen', ci = None)
plt.ylabel('churn probability'); plt.title('Churn Probability per Credit Card Possession'); plt.show()

It is observed that:

  • Churn probability does not vary significantly with credit card possession.

active_member

In [42]:
sns.countplot(data = dfr, x = 'active_member', color = 'lightblue')
plt.title('Degree-of-Activity Histogram'); plt.ylabel('# of customers'); plt.show()
In [43]:
sns.countplot(data = dfr, x = 'active_member', hue = 'churn')
plt.title('Churn Histogram per Degree-of-Activity'); plt.ylabel('# of customers'); plt.show()
In [44]:
sns.barplot(data = dfr, x = 'active_member', y = 'churn', color = 'lightgreen', ci = None)
plt.ylabel('churn probability'); plt.title('Churn Probability per Degree-of-Activity'); plt.show()

It is observed that:

  • Active and non-active customers are represented almost equally in the dataset, and
  • inactive customers have an almost double probability of churn than active ones.

credit_score

Credit scores are categorized into bins of width 20 points.

In [45]:
score_bins = np.linspace(349.9, 850, 26)
score_labels = [f"[{j}, {j+20})" for j in range(350, 850, 20)]
dfr['score_bin'] = pd.cut(x = dfr['credit_score'], bins = score_bins, labels = score_labels)
dfr[['credit_score','score_bin']].head(3)          
Out[45]:
credit_score score_bin
0 619 [610, 630)
1 608 [590, 610)
2 502 [490, 510)
In [46]:
plt.figure(figsize = (14,5))
sns.countplot(data = dfr, x = 'score_bin', color = 'lightblue')
plt.title('Credit Score Histogram'); plt.xlabel('credit_score'); plt.xticks(rotation = 70); plt.ylabel('# of customers')
plt.show()
In [47]:
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'score_bin', hue = 'churn')
plt.title('Churn Histogram per Credit Score'); plt.xlabel('credit_score'); plt.xticks(rotation = 70)
plt.ylabel('# of customers'); plt.show()

The middle 80% of the credit_score observations fall within the interval below. The effect of the second mode of the distribution at the right extreme is not very important, from a number of observations standpoint. Thus the behavior of the middle 80% of the observations is still meaningful.

In [48]:
stats.scoreatpercentile(dfr['credit_score'], 10), stats.scoreatpercentile(dfr['credit_score'], 90)
Out[48]:
(521.0, 778.0)
In [49]:
plt.figure(figsize = (14,5))
sns.barplot(data = dfr, x = 'score_bin', y = 'churn', estimator = np.mean, color = 'lightgreen', ci = None)

# Plot vertical lines to delimit the middle 80% of the observations:
score_list = list(dfr['score_bin'].sort_values().unique())
plt.axvline(x = 
score_list.index(pd.cut(x= [stats.scoreatpercentile(dfr['credit_score'], 10)], bins= score_bins, labels = score_labels)[0]),
color = 'blue', linestyle = '--', label = 'Middle 80% \n Limits')
plt.axvline(x = 
score_list.index(pd.cut(x= [stats.scoreatpercentile(dfr['credit_score'], 90)], bins= score_bins, labels = score_labels)[0]),
color = 'blue', linestyle = '--')

plt.xlabel('credit_score'); plt.ylabel('churn probability'); plt.title('Churn Probability per Credit Score'); 
plt.xticks(rotation = 70); plt.legend(); plt.show()

It is observed that:

  • For the middle 80% of the observations, i.e. for credit scores between 521 and 778,
  • the churn probability does not vary significantly/ monotonically with credit score.
  • The high churn probability of 100% for credit scores smaller than 410 pertains to a negligibly small portion of the bank customers.

salary

Salaries are categorized into bins of width 10,000 units. Based on the customer countries, Euro (€) currency is assumed.

In [50]:
salary_bins = np.linspace(0, 200000, 21)
salary_labels = [f"[{j}k, {j+10}k)" for j in range(0, 200, 10)]
dfr['salary_bin'] = pd.cut(x = dfr['salary'], bins = salary_bins, labels = salary_labels)
dfr[['salary','salary_bin']].head(3)          
Out[50]:
salary salary_bin
0 101348.88 [100k, 110k)
1 112542.58 [110k, 120k)
2 113931.57 [110k, 120k)
In [51]:
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'salary_bin', color = 'lightblue')
plt.title('Salary Histogram'); plt.xticks(rotation = 70); plt.xlabel('salary [€]'); plt.ylabel('# of customers')
plt.show()
In [52]:
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'salary_bin', hue = 'churn')
plt.title('Churn Histogram per Salary'); plt.xticks(rotation = 70); plt.xlabel('salary [€]'); plt.ylabel('# of customers')
plt.legend(loc = (1.01, 0.83), title = 'churn'); plt.show()
In [53]:
plt.figure(figsize = (14,5))
sns.barplot(data = dfr, x = 'salary_bin', y = 'churn', estimator = np.mean, color = 'lightgreen', ci = None)
plt.title('Churn Probability per Salary'); plt.xticks(rotation = 70); plt.xlabel('salary [€]'); 
plt.ylabel('churn probability'); plt.show()

As seen above:

  • The distribution of salary appears almost uniform, not compatible with an expected right-skewed normal type of distribution.
  • Although for the present project there is no more information available, in real conditions further research should be performed to verify the origin and purpose of the dataset; whether the data was intended to be representative of all customers of the bank or of just a special subgroup of customers, and whether there a was a bug in the mechanism selecting the customers for the dataset. Knowing that the salaries were estimated by the bank (feature descriptions), also the accuracy of the algorithm used by the bank should be checked.
  • While keeping in mind the above caveats, there is no consistent trend of churn probability variance with salary.

Machine Learning Model

The dataset will be used to train a Machine Learning model for predicting churn probability for the bank customers. It is assumed that the dataset on hand is representative of the bank customers, or of a subset of interest.

A Random Forest model is selected because it works well with nonlinear relationships between predictors (features) and the response variable, it works well with both numerical and categorical variables, it is robust against outliers, and does not require normalization of the features. Feature importances and partial dependence plots will be used for interpretating the model.

Data Preparation

Categorical variables country and gender are converted to dummies. For the case of country, the "drop_first" option is not selected so that all countries appear explicitly at the feature importance table. The fact that country dummies are correlated has a negligible effect on the model.

In [54]:
country_dum = pd.get_dummies(dfr['country'])
gender_dum = pd.get_dummies(dfr['gender'], drop_first = True)
dfr.columns
Out[54]:
Index(['customer_id', 'country', 'age', 'gender', 'balance', 'num_products',
       'tenure', 'credit_card', 'active_member', 'credit_score', 'salary',
       'churn', 'balance_bin', 'zero_balance', 'score_bin', 'salary_bin'],
      dtype='object')
In [55]:
X = pd.concat([dfr[['age', 'balance', 'num_products', 'tenure', 'credit_card', 'active_member', 'credit_score', 'salary']],
              country_dum, gender_dum], axis = 1)
X.head(2)
Out[55]:
age balance num_products tenure credit_card active_member credit_score salary France Germany Spain Male
0 42 0.00 1 2 1 1 619 101348.88 1 0 0 0
1 41 83807.86 1 1 0 1 608 112542.58 0 0 1 0
In [56]:
y = dfr['churn']
y.head(2)
Out[56]:
0    1
1    0
Name: churn, dtype: int64

Train and test sets are created below:

In [57]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

Feature Correlation

In [58]:
plt.figure(figsize = (10,8))
sns.heatmap(X.corr(method = 'pearson').round(2), annot = True)
plt.title('Pearson Correlation of Features'); plt.show()

Features are not highly correlated in general. Please refer to the previous section regarding the correlation between countries.

Hyperparameter Tuning

The Random Forest hyperparameters selected for tuning are: number of decision trees (to avoid high variance), maximum depth of trees (to avoid overfitting), and class weight (to penalize incorrectly classified examples, since we have unbalanced classes). A relatively limited tuning is performed here for the sake of simplicity. In reality tuning would include more parameters and more candidate values of these parameters to be checked.

In [59]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator = RandomForestClassifier(), 
                     param_grid = {'n_estimators' : [50, 200, 400], 
                        'max_depth': [4, 6, 8],
                        'class_weight': [{0: 1, 1: 0.8}, {0: 1, 1: 1.0}, {0: 1, 1: 1.2}, {0: 1, 1: 1.5}]},
                     verbose = 1, n_jobs = -1)  
grid.fit(X_train, y_train)
grid.best_params_, grid.best_score_.round(3)
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   18.5s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:  1.3min finished
Out[59]:
({'class_weight': {0: 1, 1: 1.5}, 'max_depth': 8, 'n_estimators': 400}, 0.861)

The "grid.best_score" found above is the mean cross-validated score (accuracy) of the model with the optimal hyperparameters.

Model Training

A Random Forest Classifier is now trained based on the hyperparameters selected above.

In [60]:
from sklearn.ensemble import RandomForestClassifier
mdl = RandomForestClassifier(n_estimators = grid.best_params_['n_estimators'], 
                             max_depth = grid.best_params_['max_depth'], 
                             class_weight = grid.best_params_['class_weight'], oob_score = True, n_jobs = -1)                            
mdl.fit(X_train, y_train)
y_pred = mdl.predict(X_test)
y_prob = mdl.predict_proba(X_test)
from sklearn.metrics import classification_report, confusion_matrix, \
                            f1_score, precision_score, recall_score, accuracy_score
c_m = pd.DataFrame(data = confusion_matrix(y_test, y_pred)); c_m.index.names = ['Actual']; c_m.columns.names = ['Predicted']
print(classification_report(y_test, y_pred)); c_m
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      2373
           1       0.77      0.49      0.60       627

    accuracy                           0.86      3000
   macro avg       0.83      0.73      0.76      3000
weighted avg       0.86      0.86      0.85      3000

Out[60]:
Predicted 0 1
Actual
0 2284 89
1 321 306
In [61]:
print('The out-of-bag accuracy is {}.'. format(mdl.oob_score_.round(3)))
The out-of-bag accuracy is 0.863.
In [62]:
print('The training set accuracy is {}.'.format(accuracy_score(y_train, mdl.predict(X_train)).round(3)))
The training set accuracy is 0.885.
In [63]:
print('The test set accuracy is {}.'.format(accuracy_score(y_test, y_pred).round(3)))
The test set accuracy is 0.863.

It is observed that:

  • The out-of-bag accuracy obtained during model training and the accuracy of the training set are comparable to the test set accuracy. This means that the model has no overfitting.
  • The precision, recall, and the overall accuracy are satisfactory for label "0" (no churn). Precision and recall for label "1" (churn) are not excellent. This could be due to the small number of features and training examples available in this dataset. Other features which would also shed light on customer retention are: customer marital status, occupation, type of products they have with the bank, whether they had any requests rejected by the bank (e.g. application for credit card), whether they have filed any complaints with the bank, and others.
  • Improving recall for label "1" would cause a reduction on the precision of label "1", due to the precision/recall trade-off behavior. A business-oriented approach on selecting an optimal precision/recall combination is presented in a later section.

Feature Importances

In [64]:
ftrs = pd.DataFrame(data = mdl.feature_importances_, index = X.columns, columns = ['importance'])
ftrs.sort_values('importance', ascending = False, inplace = True)
plt.figure(figsize = (14,6))
sns.barplot(data = ftrs, y = ftrs.index, x = 'importance', color = 'lightgreen')
plt.xlabel('Importance'); plt.ylabel('Model Features'); plt.title('Feature Importances'); plt.show()

In general the feature importances seem to agree with the conclusions of the data exploration for each feature; specifically:

  • The importance of the top 4 features, agrees with the respective sections of the exploratory analyses, for age, num_products, balance, and active_member.
  • The importances of credit_score, salary and tenure seem overestimated compared to what the exploratory analysis of credit_score, salary and tenure shown. A possible reason is that feature importance method can overestimate the importance of continuous features or discrete features with high cardinality. A different method such as Permutation Feature Importance, could be also tried.
  • Subsequently, the importances of Germany and Gender (Male) are underestimated compared to the exploratory analysis results for country and gender, and should move above the credit_score, salary, and tenure importances.

Partial Dependence Plots

Partial dependence plots for the most important features are presented below, to visualize the marginal effect of each of these features on the response variable. The vertical lines on the x-axes of the plots indicate the deciles of the feature values.

In [65]:
from sklearn.inspection import partial_dependence, plot_partial_dependence
In [66]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['age'], percentiles = (0,1))
plt.title('age: PDP'); plt.ylim([0, 0.55]); plt.show()
In [67]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['num_products'], percentiles = (0,1))
plt.title('num_products: PDP'); plt.ylim([0, 0.8]); plt.show()
In [68]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['balance'], percentiles = (0,1))
plt.title('balance: PDP'); plt.ylim([0, 0.40]); plt.show()
In [69]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['active_member'], percentiles = (0,1))
plt.title('active_member: PDP'); plt.ylim([0, 0.35]); plt.show()
In [70]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['Germany'], percentiles = (0,1))
plt.title('Germany: PDP'); plt.ylim([0, 0.35]); plt.show()
In [71]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['Male'], percentiles = (0,1))
plt.title('gender: PDP'); plt.ylim([0, 0.35]); plt.show()

Partial dependence plots for the least important features are presented below.

In [72]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['credit_score'], percentiles = (0,1))
plt.title('credit_score: PDP'); plt.ylim([0, 0.5]); plt.show()

The peak of partial dependence for credit scores < 400 corresponds to a negligibly small portion of customers.

In [73]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['salary'], percentiles = (0,1))
plt.title('salary: PDP'); plt.ylim([0, 0.5]); plt.show()
In [74]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['tenure'], percentiles = (0,1))
plt.title('tenure: PDP'); plt.ylim([0, 0.5]); plt.show()
In [75]:
plot_partial_dependence(estimator = mdl, X = X_train, features = ['credit_card'], percentiles = (0,1))
plt.title('credit_card: PDP'); plt.ylim([0, 0.5]); plt.show()

No plots needed for combined features because they are not generally correlated, as seen previously.

Cutoff Probability Optimization

By default, the cutoff probability for an example to be classified as label "1" is 50%. As seen in Model Training section, the test set yields a certain number of False Negatives (FN) and False Positives (FP), between which there is a trade-off behavior.

A FN means that a customer is predicted as unlikely to leave, but eventually leaves. From a business standpoint, churned customers mean a high cost for the bank. This includes the cost of acquiring new customers, the increased need for customer support until new customers get familiar with the products, etc. Therefore there is a need to minimize the FN but not eliminate them, since this would cause a boost to the FP. Despite lower than the FN cost, FP also cause a cost to the bank. FP cost includes targeting the customers predicted as likely to leave with ads and marketing promotions, to prevent them from leaving. A cutoff probability optimization is performed below in order to minimize the bank's expected cost per customer. Assumed are the hypothetical values:

  • average cost per churned customer = €350,
  • average cost for marketing campaign per targeted customer = €100.
In [76]:
# Find new cutoff for minimum expected cost per customer:
proba = pd.DataFrame(data = mdl.predict_proba(X_test), columns = ['prob_0', 'prob_1'])
cost_fn = 350 ; cost_fp = 100
expected_cost = [] ; f1 = []
for i in np.linspace(0, 100, 101):
    proba['new_class'] = proba['prob_1'].apply(lambda x: 1 if x >= i/100 else 0)
    c_m = confusion_matrix(y_test, proba['new_class'])
    expected_cost.append(c_m[0][1] / c_m.sum() * cost_fp + c_m[1][0] / c_m.sum() * cost_fn)
    f1.append(f1_score(y_test, proba['new_class']))
cut_off = np.argmin(expected_cost) / 100   
    
# Plot expected cost, f1 score:   
plt.figure(figsize = (10, 5))
plt.subplot(1,2,1)
plt.plot(np.linspace(0, 1, 101), expected_cost)
plt.title('Expected Cost Optimization'); plt.xlabel('Cutoff probability for label "1"') 
plt.ylabel('Expected cost per customer')
plt.subplot(1,2,2)
plt.plot(np.linspace(0, 1, 101), f1, label = 'f1 score', color = 'orange') 
plt.title('f1-Score'); plt.xlabel('cutoff probability for label "1"'); plt.ylabel('f1-Score')
plt.show()

# Print cutoff probability, min expected cost, confusion matrix:
print('The optimized Cutoff Probability for Class "1" is {}, \nand the minimum Expected Cost is {} € per customer.'\
      .format(cut_off, min(expected_cost).round(1)))

c_m = pd.DataFrame(data = confusion_matrix(y_test, proba['prob_1'].apply(lambda x: 1 if x >= cut_off else 0)))
c_m.index.names = ['Actual']; c_m.columns.names = ['Predicted w/ New Cutoff']
c_m
The optimized Cutoff Probability for Class "1" is 0.29, 
and the minimum Expected Cost is 32.8 € per customer.
Out[76]:
Predicted w/ New Cutoff 0 1
Actual
0 1995 378
1 173 454

As expected by the assumed ratio of the FP / FN cost per customer, the optimized cutoff probability caused more examples to "move" from the FN category to FP.

Conclusions

Recommendations to Reduce Churn

  • num_products: Going from 1 product to 2 seems to be a "turning point" for retention. Customers with 1 product (almost 50% of all) have a tremendously higher churn probability than customers with 2 products, (as shown here and here). This happens because it may be harder for customers who invested in many products to transfer all these to another bank. The segment of customers currently having 1 product presents a big opportunity, therefore Marketing shall target these customers and try to sell them additional products.
  • active_member: Customer participation & engagement benefits retention. Inactive customers are almost 50% of all, and have a considerably higher churn probability than active ones (as shown here and here). Marketing department shall try to increase engagement through emails about offers, reminders, and newsletters. Product department shall ensure that the customer experience (in-person, phone, website, mobile app) is at a high level. A/B testing experiments could be used to increase engagement in the website or the mobile app, e.g. percentage of users who make at least one money transfer per week.
  • age: The user segment above 40 years old does not perform well for customer retention, (as shown here and here). This segment constitutes almost 35% of the customers and has a churn probability above the overall average churn, which also increases exponentially with age. This happens because older customers may be more demanding regarding their financial targets, but also less familiar with the website or mobile app functions. The bank shall ensure to provide customized packages for the needs of these customers, e.g. related to retirement investments. Also, User Experience Research (UER) could shed light on the bank website/mobile app usability for these ages.
  • country: Germany is not performing well; it constitutes almost 25% of the customers and has a considerably higher churn probability than the rest countries (as seen here and here). This may be due to the local banking system conditions and competition. The bank needs to adjust its products and services to the local culture, since there is a lot of room for growth in this customer segment.
  • balance: Almost 1/3 of customers have zero balance and have a considerably lower churn probability than the rest customers (as shown here and here). No information is currently available on how many and what type of products the customers have, and whether they have an account at all, or are just categorized as zero-balanced. In a real project, this information should be requested from the bank by the data scientists' team. This way it would be possible to further "slice" the customers, and make specific suggestions to the Marketing department on how to reduce churn probability for the customers with non-zero balances (2/3 of all).
  • gender: Female customers have a considerably higher churn probability than Male (as shown here and here). The bank needs to address better the needs of the female customers using for example, feedback from user surveys.

Cutoff probability to minimize bank costs

To minimize the bank's overall expected cost per customer, an optimal cutoff probability for the classifier was found previously. This new cutoff point was based on assumed hypothetical values for the bank's average cost per False Positive (FP) and per False Negative (FN) case.

Further Considerations

  • If were available, certain additional features could improve the precision and recall of the classifier: customer marital status/dependents, occupation, type of products they already have (e.g. loans), whether the customer has ever filed a complaint with the bank, whether the customer has had any unfulfilled requests (e.g. application for credit card rejected by the bank).
  • The distribution of certain features does not follow expected shapes: Estimated salary is almost uniform instead of right skewed or of a half-normal type, tenure is also quite uniform and is limited to 10 years (perhaps it is a newer bank). Additional information would be needed from the bank on whether this dataset was intended to be representative of all customers or only of a special subset of those. In the case it was meant to be representative of all customers, the bank's algorithm selecting customers for the dataset should be checked for bugs.
  • The overall distribution of balance shows almost 1/3 of the customers with zero balance, whereas Germany is the only country without customers with zero balance. It would be useful to have more information regarding the reason of the high frequency of zero balances, and how many and what types of products these customers have.
  • Different types of algorithms could also be tried for classification (e.g. Logistic Regression) or for obtaining feature importances (e.g. Permutation Feature Importance).


---------------------------------------------------------------------------- / END OF NOTEBOOK, THANK YOU! / ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- © 2021 Michail Mavrogiannis

You are welcome to visit My LinkedIn profile and see my other projects in My GitHub profile.

Michail Mavrogiannis